Skip to content

[https://nvbugs/6225775][fix] Fix spec count graph#15212

Open
chuangz0 wants to merge 2 commits into
NVIDIA:feat/bench_xfrom
chuangz0:fix/spec-count-graph
Open

[https://nvbugs/6225775][fix] Fix spec count graph#15212
chuangz0 wants to merge 2 commits into
NVIDIA:feat/bench_xfrom
chuangz0:fix/spec-count-graph

Conversation

@chuangz0

Copy link
Copy Markdown
Collaborator

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

dongfengy and others added 2 commits May 29, 2026 02:39
…on (NVIDIA#14537)

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
CUDA graph warmup can capture speculative sampling without the generated-token count frequency-penalty path when warmup requests have no frequency penalty. Later RWLT GPT-OSS disagg requests replay that graph with frequency_penalty and prompt_ignore_length, so repeated generated tokens are not penalized.

Add speculative logits penalty CUDA ops, preserve sequence-slot count state across CUDA graph metadata/replay, append accepted tokens back into count state, and gate forced graph count capture to the disaggregated generation role by default.

Validation: python3 -m py_compile on modified Python modules; git diff --cached --check; GPT-OSS disagg original NVBug config ran 8 total auto-gating runs with >10k=0 and 16K/length=0.
@chuangz0 chuangz0 requested review from a team as code owners June 10, 2026 08:41
@chuangz0 chuangz0 requested review from byshiue and cascade812 and removed request for a team June 10, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants